主要跟踪器基于先前的预测或初始边界框作为模型输入(即搜索区域)生成固定尺寸的矩形区域。尽管这种方式导致了提高的跟踪效率,但固定尺寸的搜索区域缺乏灵活性,并且在情况下可能会失败,例如快速运动和干扰物干扰。由于搜索区域有限,跟踪器往往会丢失目标对象,或者由于搜索区域过多而受到干扰因素的干扰。在这项工作中,我们提出了一个新颖的跟踪范式,称为搜索区域调节跟踪(SRRT),该范式应用了建议的搜索区域调节器,以动态地估算每个帧的最佳搜索区域。为了调整对象在跟踪过程中的外观变化,我们进一步提出了锁定状态确定的更新策略以进行参考框架更新。我们的SRRT框架在没有精美设计的情况下非常简洁,但在七个具有挑战性的基准方面,与其他最先进的跟踪器有关基线的改进和竞争成果明显。在大规模的Lasot基准测试中,我们的SRRT改善了siamrpn ++和Transt,其绝对增长为4.6%和3.1%。
translated by 谷歌翻译
实例对象检测在智能监视,视觉导航,人机交互,智能服务和其他字段中扮演重要作用。灵感来自深度卷积神经网络(DCNN)的巨大成功,基于DCNN的实例对象检测已成为一个有前途的研究主题。为了解决DCNN始终需要大规模注释数据集来监督其培训的问题,而手动注释是耗尽和耗时的,我们提出了一种基于共同训练的新框架,称为克自我标记和检测(Gram-SLD) 。建议的克拉姆-SLD可以自动注释大量数据,具有非常有限的手动标记的关键数据并实现竞争性能。在我们的框架中,克朗损失被定义并用于构造两个完全冗余和独立的视图和一个关键的样本选择策略以及自动注释策略,可以全面考虑精度并回忆,以产生高质量的伪标签。 Public Gmu厨房数据集的实验,活动视觉数据集和自制的Bhid-Item DataSetDemonstrite,只有5%的标记训练数据,我们的克斯LLD比较了对象检测中的竞争性能(少于2%的地图丢失)通过完全监督的方法。在具有复杂和变化环境的实际应用中,所提出的方法可以满足实例对象检测的实时和准确性要求。
translated by 谷歌翻译
Correlation acts as a critical role in the tracking field, especially in recent popular Siamese-based trackers. The correlation operation is a simple fusion manner to consider the similarity between the template and the search re-
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Nowadays, time-stamped web documents related to a general news query floods spread throughout the Internet, and timeline summarization targets concisely summarizing the evolution trajectory of events along the timeline. Unlike traditional document summarization, timeline summarization needs to model the time series information of the input events and summarize important events in chronological order. To tackle this challenge, in this paper, we propose a Unified Timeline Summarizer (UTS) that can generate abstractive and extractive timeline summaries in time order. Concretely, in the encoder part, we propose a graph-based event encoder that relates multiple events according to their content dependency and learns a global representation of each event. In the decoder part, to ensure the chronological order of the abstractive summary, we propose to extract the feature of event-level attention in its generation process with sequential information remained and use it to simulate the evolutionary attention of the ground truth summary. The event-level attention can also be used to assist in extracting summary, where the extracted summary also comes in time sequence. We augment the previous Chinese large-scale timeline summarization dataset and collect a new English timeline dataset. Extensive experiments conducted on these datasets and on the out-of-domain Timeline 17 dataset show that UTS achieves state-of-the-art performance in terms of both automatic and human evaluations.
translated by 谷歌翻译
Masked image modeling (MIM) has shown great promise for self-supervised learning (SSL) yet been criticized for learning inefficiency. We believe the insufficient utilization of training signals should be responsible. To alleviate this issue, we introduce a conceptually simple yet learning-efficient MIM training scheme, termed Disjoint Masking with Joint Distillation (DMJD). For disjoint masking (DM), we sequentially sample multiple masked views per image in a mini-batch with the disjoint regulation to raise the usage of tokens for reconstruction in each image while keeping the masking rate of each view. For joint distillation (JD), we adopt a dual branch architecture to respectively predict invisible (masked) and visible (unmasked) tokens with superior learning targets. Rooting in orthogonal perspectives for training efficiency improvement, DM and JD cooperatively accelerate the training convergence yet not sacrificing the model generalization ability. Concretely, DM can train ViT with half of the effective training epochs (3.7 times less time-consuming) to report competitive performance. With JD, our DMJD clearly improves the linear probing classification accuracy over ConvMAE by 5.8%. On fine-grained downstream tasks like semantic segmentation, object detection, etc., our DMJD also presents superior generalization compared with state-of-the-art SSL methods. The code and model will be made public at https://github.com/mx-mark/DMJD.
translated by 谷歌翻译
Improving the visual quality of the given degraded observation by correcting exposure level is a fundamental task in the computer vision community. Existing works commonly lack adaptability towards unknown scenes because of the data-driven patterns (deep networks) and limited regularization (traditional optimization), and they usually need time-consuming inference. These two points heavily limit their practicability. In this paper, we establish a Practical Exposure Corrector (PEC) that assembles the characteristics of efficiency and performance. To be concrete, we rethink the exposure correction to provide a linear solution with exposure-sensitive compensation. Around generating the compensation, we introduce an exposure adversarial function as the key engine to fully extract valuable information from the observation. By applying the defined function, we construct a segmented shrinkage iterative scheme to generate the desired compensation. Its shrinkage nature supplies powerful support for algorithmic stability and robustness. Extensive experimental evaluations fully reveal the superiority of our proposed PEC. The code is available at https://rsliu.tech/PEC.
translated by 谷歌翻译
Diagram object detection is the key basis of practical applications such as textbook question answering. Because the diagram mainly consists of simple lines and color blocks, its visual features are sparser than those of natural images. In addition, diagrams usually express diverse knowledge, in which there are many low-frequency object categories in diagrams. These lead to the fact that traditional data-driven detection model is not suitable for diagrams. In this work, we propose a gestalt-perception transformer model for diagram object detection, which is based on an encoder-decoder architecture. Gestalt perception contains a series of laws to explain human perception, that the human visual system tends to perceive patches in an image that are similar, close or connected without abrupt directional changes as a perceptual whole object. Inspired by these thoughts, we build a gestalt-perception graph in transformer encoder, which is composed of diagram patches as nodes and the relationships between patches as edges. This graph aims to group these patches into objects via laws of similarity, proximity, and smoothness implied in these edges, so that the meaningful objects can be effectively detected. The experimental results demonstrate that the proposed GPTR achieves the best results in the diagram object detection task. Our model also obtains comparable results over the competitors in natural image object detection.
translated by 谷歌翻译
Aligning users across networks using graph representation learning has been found effective where the alignment is accomplished in a low-dimensional embedding space. Yet, achieving highly precise alignment is still challenging, especially when nodes with long-range connectivity to the labeled anchors are encountered. To alleviate this limitation, we purposefully designed WL-Align which adopts a regularized representation learning framework to learn distinctive node representations. It extends the Weisfeiler-Lehman Isormorphism Test and learns the alignment in alternating phases of "across-network Weisfeiler-Lehman relabeling" and "proximity-preserving representation learning". The across-network Weisfeiler-Lehman relabeling is achieved through iterating the anchor-based label propagation and a similarity-based hashing to exploit the known anchors' connectivity to different nodes in an efficient and robust manner. The representation learning module preserves the second-order proximity within individual networks and is regularized by the across-network Weisfeiler-Lehman hash labels. Extensive experiments on real-world and synthetic datasets have demonstrated that our proposed WL-Align outperforms the state-of-the-art methods, achieving significant performance improvements in the "exact matching" scenario. Data and code of WL-Align are available at https://github.com/ChenPengGang/WLAlignCode.
translated by 谷歌翻译
Several self-supervised representation learning methods have been proposed for reinforcement learning (RL) with rich observations. For real-world applications of RL, recovering underlying latent states is crucial, particularly when sensory inputs contain irrelevant and exogenous information. In this work, we study how information bottlenecks can be used to construct latent states efficiently in the presence of task-irrelevant information. We propose architectures that utilize variational and discrete information bottlenecks, coined as RepDIB, to learn structured factorized representations. Exploiting the expressiveness bought by factorized representations, we introduce a simple, yet effective, bottleneck that can be integrated with any existing self-supervised objective for RL. We demonstrate this across several online and offline RL benchmarks, along with a real robot arm task, where we find that compressed representations with RepDIB can lead to strong performance improvements, as the learned bottlenecks help predict only the relevant state while ignoring irrelevant information.
translated by 谷歌翻译